This data contains information about quality of (Vinho Verde) white wines from Minho region (in Portugal). It contains objective measurements of 11 chemical attributes of 4898 different white wines along with a quality score. The data set is tidy and complete i.e. there are no data entries missing.
The quality score is median value from an expert jury with at least three judges. It would be very interesting to have also the scores of individual judges, so I could see the variance in scores. Unfortunately only the aggregate value is provided.
Quality takes a numerical value, but it is effectively an ordered factor (and is treated accordingly). According to dataset description possible values for quality range from 0 (very bad) and 10 (very excellent). The academic paper [1] referencing the dataset suggests that expert judged the wine on this numeric scale (as opposed to a qualitative scale later mapped to numbers).
Quality and Rating are the only two factor variables. The remaining variables are numerical.
According to the dataset description numerical variables can be divided into:
The value range for all variables is quite narrow and similar for all ratings. The distributions don’t have a significant long tail as they had with diamond prices or facebook friend counts.
Let me start with factoring alcohol and pH through rating. Will set the scale to ‘free_y’ so the less frequent rating are not flattened.
The shape of the alcohol histogram is very interesting. It doesn’t look much like normal distribution. It has multiple peaks.
Interestingly when I factored the histograms by rating, one can notice strongly compementary trends in lower and higher rating wines. Many bad and medium wines have lower alcohol content. On the other hand many good and excellent wines have a high alcohol content.
The histogram of total distribution of pH resembles a bell curve very closely. It is left skewed and has a bit of a long tail to the right.
Such trends are even more pronunced in most of the other variables. Chlorides are good example.
There are also couple of exceptions to these general trends. These are the already discussed alcohol, residual.sugar and sulphates.
I investigated all the variables with the function bellow. For most variables general histograms showed a similar trends and (seemingly) normal distribution. Generally histograms resemble a bell curve very closely. It is left skewed and has a bit of a long tail to the right. There are multiple exceptions from this trend: alcohol and to lesser extent residual.sugar and sulphates.
The shape of the alcohol histogram is very interesting. It doesn’t look much like normal distribution. It has multiple peaks. Residual.sugar and sulphates merely have some additional spikes to the base curve.
On the other hand Boxplots and Summaries of some variables revealed some interesting trends.
Boxplot and Summaries show tha medium (rating) wines typically exhibit the most variety of values (and most outliers). Since medium is the most frequent category, it is not surprising. Conversely excelent wines are rare and generally have the lowest variety.
In this analysis I am interested mostly in the general trends for the particular ratings. I am not interested in the outlier of variables. This analysis focus on comparing characteristics of the wines grouped by their ratings.
Notes:
Alcohol - similar trends but bad and medium wines are skewed towards lower aclohol content
pH - very similar trends, no outliers for excelent wines
volatile.acidity - higher for bad wines (check summary)
chlorides - higher for bad wines (check summary), very cool strong trend 1,2,3Q + mean drops with rating
free.sulfur.dioxide - could be used to prune bad wines? (check summary)
free/total.sulfur - could be used to prune bad wines, very cool strong trend 1,2,3Q + mean grows with rating
density - higher for bad wines (check summary)
alcohol - lower for bad wines
Some of the variable considered below revealed interesting trends. Namely chlorides, ratio between free.sulfur.dioxide and total.sulfur.dioxide and density. These trends can be spotted in boxplots and become clearly visible in statistical summary.
This plot suggests that the median for good and excellent wines is less the the first quartile for medium and bad wines. This means that good wines are statistically more likely to have lower value of chlorides then bad ones. The summary confirms this.
## wines_w$rating: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01300 0.03750 0.04600 0.05056 0.05400 0.29000
## --------------------------------------------------------
## wines_w$rating: medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03700 0.04400 0.04774 0.05100 0.34600
## --------------------------------------------------------
## wines_w$rating: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.03100 0.03700 0.03819 0.04400 0.13500
## --------------------------------------------------------
## wines_w$rating: excelent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01400 0.03000 0.03550 0.03801 0.04400 0.12100
So if I would discard all wines with chlorides above 0.03700 (the median for good wines), I would discard majority of bad and medium wines and still keep a significant proportion of good and excellent ones.
If I decide to focus exclusively on excellent wines, I can go even tighter and remove wines with chlorides above 0.0355 (median for excellent wines).
Another observation is that value for all quartiles and the mean decrases with rating. So there is a strong trend there.
For free.sulfur.dioxide boxplot reveals a promise of a trend allowing me to discard some bad wines.
Interesting the ratio of free and total sulfur dioxide reveals an even clearer trend. This is true despite of the total sulfur dioxide not showing a strong trend on its own.
Once again I validate this insight by inspecting the the statistical summary.
## wines_w$rating: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03371 0.10540 0.16130 0.18880 0.23850 0.65680
## --------------------------------------------------------
## wines_w$rating: medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02362 0.18810 0.25000 0.25240 0.31120 0.71050
## --------------------------------------------------------
## wines_w$rating: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0500 0.2118 0.2717 0.2757 0.3333 0.6429
## --------------------------------------------------------
## wines_w$rating: excelent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.07895 0.22310 0.28770 0.28930 0.33620 0.60380
In this case median of good wines is above the 3rd Quartile of bad wines. And this allows me to filter out large majority of bad wines.
Similarly to the chlorides all Quartiles and the mean follow a clear (this time growing) trend with the rating.
I did the same analysis for density.
## wines_w$rating: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9892 0.9926 0.9941 0.9943 0.9960 1.0000
## --------------------------------------------------------
## wines_w$rating: medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9872 0.9923 0.9944 0.9945 0.9966 1.0390
## --------------------------------------------------------
## wines_w$rating: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9906 0.9918 0.9925 0.9937 1.0000
## --------------------------------------------------------
## wines_w$rating: excelent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9903 0.9916 0.9922 0.9935 1.0010
This time median for good and excellent wines is bellow the first quartile for medium and bad ones. So I can again discard some medium and bad wines as with chlorides.
In this case the quantiles and mean do not follow a clear trend w.r.t to rating.
## wines_w$rating: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.40 10.10 10.17 10.80 13.50
## --------------------------------------------------------
## wines_w$rating: medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.40 10.00 10.27 11.00 14.00
## --------------------------------------------------------
## wines_w$rating: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.60 10.60 11.40 11.37 12.30 14.20
## --------------------------------------------------------
## wines_w$rating: excelent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 11.00 12.00 11.65 12.60 14.00
Alcohol content almost follows a growing trend for quantiles and mean w.r.t ratio. I decided not to filter out wines with low alcohol content, since alcohol content is typically known to the consumer (unlike the other investigated variables). Hence removing wines with low alcohol volume may reduce choices for consumer with this preference.
Based on the above observation I have created a filtered dataset. I kept wines with following properties:
The filtered data set contains 374 wines, which represent 7.63% of the original dataset of 4898 wines.
More importantly the distribution of the rating changed significantly as apparent from this grid plot. The dashed shape in the backgound shows the barplot with all the wines (before filtering).
Distribution of wines before filtering
##
## 3 4 5 6 7 8
## 0.004083299 0.033278889 0.297468354 0.448754594 0.179665169 0.035728869
## 9
## 0.001020825
Distribution of wines after filtering
##
## 4 5 6 7 8 9
## 0.005347594 0.053475936 0.403743316 0.443850267 0.085561497 0.008021390
The original distribution is in the top left bar plot and the filtered one is in the bottom left one. The ratio of < 6 wines (short for wines with quality 6 and below) is much smaller. Wines with the lowest quality 3 were completely removed. The most frequent quality is now 7 (44.4%), whereas originally it was 6 (with 44.9%).
More then 98.5% of < 6 wines were removed. 93% of 6’s were removed in contrast to only 81% / 82% for 7/8’s and only 40% of 9’s.
Therefore there is a higher ratio of wines of above 6 (good and excellent, 53.75%) then below 7 (medium and bad, 46.25%).
Because there is now similar number of 6’s and 7’s left it will be easier to spot differences and commonalities between the two groups.
Next I have re-examined the variables again, this time considering only the filtered wines. I was curious whether new trends have emerged. This is indeed the case. The most pronounced trend can be observed for residual sugar.
## filtered_wines_w$rating: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.20 1.55 1.90 1.90 2.25 2.60
## --------------------------------------------------------
## filtered_wines_w$rating: medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.800 1.100 1.600 2.241 2.600 10.800
## --------------------------------------------------------
## filtered_wines_w$rating: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.600 2.400 3.038 4.212 9.700
## --------------------------------------------------------
## filtered_wines_w$rating: excelent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.800 1.700 4.200 3.671 4.975 8.300
Once again I could remove majority (more then 75%) of wines < 5 while certainly keeping at least 50% of remaining > 6 wines. To do this, I would need to remove all wines with residual sugar > 2.6 (3Q for medium wines, blue dotted line).
In the first round of filtering the aim was to remove bulk of bad and medium wines. Now I have more control. I could also use residual sugar to separate good and excellent wines by keeping only wines with residual sugar above 4.212 (3Q good, green dotted line).
Besides residual sugar, I could use value of total.sulfur.dioxide > 113.8 to remove majority of bad and medium wines
Or use the value of citric.acid > 0.3375 to separate good and excellent wines (after I removed the medium wines through residual sugar or total sulfur dioxide value).
Alcohol remains a promising candidate for removing most bad and medium wines, while keeping at least the half of excellent ones.
In summary I have many more options for additional filtering:
The trends in the filtered wines appear much clearer. They also allow for filtering not only medium wines but even to reduce the proportion between good an excellent ones.
However I was not able to complete remove bad wines with simple pruning described above. If I apply all the filtering steps simultaneously, there would only 9 wines left: 5 bad, 2 medium and 2 excellent. But if I skip the citric acid step, I will have 5 bad, 11 medium and 6 excellent. The latter distribution is seems much more promising.
Next I would like investigate how much filtering reduces the variety of wines. I will do that by drawing bi- and multivariate plots.
There is one dependent variable quality describing the quality of the wine. The remaining variables are results of various chemical measurements. These variables all seem to follow normal distribution.
Main feature of interest is clearly the quality of wine. This variable is effectively a ranking ranging from 0 (very bad) and 10 (very excellent). This dataset contains only wines with quality between 3 and 9. However there are only very few wines of quality 3 and 9, and the most wines has the the quality 6, 5 and 7.
table(wines_w$quality)
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
The interesting question is how is quality determined by the other variables.
I have identified a number of variables (and their critical values) which allow me to distinguish wines >= 7 (quality 7 is good, 8 and 9 is excellent) from medium (5 and 6) and bad wines (below 5).
I found the variables and values by inspecting the boxplots and looking at statistical summaries. By removing wines not meeting those characteristics (except alcohol) I have obtained a much smaller dataset of 374 wines (7.63% of total). In the filtered dataset the majority of wines is good or excellent (44.39% and 9.36%). These proportions were much smaller in the original dataset (17.97% and 3.67%). This shows that the majority of bad and medium wines was removed.
Repeating the analysis for filtered wines, I identified more options for filtering:
Experimenting with these additional step I ended up with only 17 wines: 5 bad, 11 medium and 6 excellent.
Yes. I created an ordered factor rating witch is a simplified version of quality I have use ford plotting. The possible values are bad (quality 3,4), medium (quality 5,6), good (quality 7) and excellent (quality 8,9).
I introduced ratio.sulfur.dioxide which a numeric variable whose value is free.sulfur.dioxide / total.sulfur.dioxide
I also introduced an auxiliary variable filtering to draw before/after boxplots.
I have plotted the various variables (e.g. sulphates) against quality in scatterplots. To remove outliers I have plotted only values which fell into the interval [median - 2*IQR, median + 2*IQR] (within the whiskers for a box plot). I also plotted the some statistical summaries among the data points: mean (red dot) and 1st (yellow), 2nd / median (orange) and 3rd quantile (brown). The datapoints were coloured according to their rating.
In general scatterplots revealed more structure but mostly confirmed observations from boxplots. Take for example sulpahtes.
The majority of wines is of quality 5 to 7. There appears no clear pattern linking suplhates to the quality (and therefore rating). This is the case for most variables apart from few exceptions.
The exceptions are the same I have identified from boxplots:
There is one more interesting case volatile acidity. In hindsight the trend was apparent in the boxplot/summary as well. But when I looked at the boxplot I was focused on discriminating bad and medium wines from good an excellent (as opposed to removing bad ones).
Anyhow adding filtering by volatile acidity would not improving filtering. Actually it would make things much worse. Removing wines through filtering is very delicate trial and error procedure. One has to be carefull not to remove too many wines.
I have also calculated the Pearson’s R between quality and the other variables. Interestingly the four variables with the strongest correlation (in a absolute terms) are the same I have found through box plot investigation:
## alcohol density chlorides
## 0.4355747 -0.3071233 -0.2099344
## ratio.sulfur.dioxide volatile.acidity total.sulfur.dioxide
## 0.1972141 -0.1947230 -0.1747372
Even the +/-sign of correlation agrees with the sign of filtering (above or below a given value). This might be a coincidence. Except for alcohol (and perhaps density) to correlations are quite weak.
On the other hand, the aim of filtering is to remove low quality wines, which dominate the distribution. Therefore even variables with relativelly low correlation with quality, could be good candidates for filtrering.
I also looked at the corellation between these variables:
## volatile.acidity chlorides total.sulfur.dioxide
## volatile.acidity 1.00000000 0.07051157 0.08926050
## chlorides 0.07051157 1.00000000 0.19891030
## total.sulfur.dioxide 0.08926050 0.19891030 1.00000000
## density 0.02711385 0.25721132 0.52988132
## alcohol 0.06771794 -0.36018871 -0.44889210
## ratio.sulfur.dioxide -0.19616085 -0.03321768 -0.01344785
## density alcohol ratio.sulfur.dioxide
## volatile.acidity 0.02711385 0.06771794 -0.19616085
## chlorides 0.25721132 -0.36018871 -0.03321768
## total.sulfur.dioxide 0.52988132 -0.44889210 -0.01344785
## density 1.00000000 -0.78013762 -0.06552475
## alcohol -0.78013762 1.00000000 0.06446642
## ratio.sulfur.dioxide -0.06552475 0.06446642 1.00000000
Alcohol and density show a very strong correlation. Density is moderately correlated with total.sulfur.dioxide. The correlation between alcohol and total total.sulfur.dioxide is a bit weaker.
I have plotted variables against quality and mostly confirmed the insight from box plots.
What I found quite interesting is that correlation coefficients (with wine quality) quite clearly pointed to the same variables I identified via box plots analysis. This makes sense for variables showing strong correlation. But even variables with weak correlation (bellow 0.2) can be used for filtering.
The correlation coefficients allowed me to judge the strength of correlations. Alcohol has the strongest correlation with quality (0.44). Density comes next with -0.31.
Alcohol and density show a very strong correlation (-0.78). Density is moderately correlated with total.sulfur.dioxide (0.53). The correlation between alcohol and total total.sulfur.dioxide is a bit weaker (-0.45).
The negative correlation between alcohol and density is clearly the strongest one.
I started with plotting the variable pairings identified in the previous section. I have removed the outliers (lying outside of the IQR) and scaled the axes to focus on the majority of data points. I have also customized the alpha value, decreasing alpha for medium (forming the overwhelming majority) and increasing the alpha for bad and excellent wines (which are rare).
The plots confirm trends I already know: e.i. high alcohol and low density wines tend to have higher quality. However bad quality wines are scattered all over the plot.
The plots are still very noisy and dominated by medium wines (green). Next I decided to focus only on filtered wines. I will repeat the same analysis as before. First I calculate Pearson’s R between quality and other variables, this time considering only the filtered wines.
## alcohol residual.sugar total.sulfur.dioxide
## 0.26456153 0.25614075 0.20561321
## pH free.sulfur.dioxide sulphates
## 0.13566092 0.12016675 0.10712705
## volatile.acidity ratio.sulfur.dioxide fixed.acidity
## 0.10312750 -0.08281238 -0.04397290
## citric.acid chlorides density
## 0.04242991 -0.02023482 -0.01545611
The resulting list is very different from the one for all wines described previously.
## alcohol density chlorides
## 0.4355747 -0.3071233 -0.2099344
## ratio.sulfur.dioxide volatile.acidity total.sulfur.dioxide
## 0.1972141 -0.1947230 -0.1747372
There are very few similarities in the two lists. In both alcohol shows the strongest correlation with quality. For filtered wines the difference between alcohol and the second best variable (residual sugar) is very small, in contrast to the other list.
Accidentally both list have three entries with correlation above 0.2 (in absolute terms). But for filtered wines the correlation drops much quicker. I start with the top three variables for filtered wines (alcohol, residual.sugar, total.sulfur.dioxide) against each other.
The plot of residual.sugar against total.sulfur.dioxide reveals an area dominated by good and excellent wines (where residual sugar is between 3 and 9). Next I will zoom into this section of the plot. I will also visualize two additional variables through size and shape.
In plot below size corresponds to alcohol and shape to pH. I have tried different combinations of variables, but I couldn’t find a combination which reliably discriminates between good, medium and excellent wines. On the contrary, I have established that medium and excellent wines can be very similar (examples plotted against a red background).
Once again I have mostly confirmed trends from my previous investigation.
It was very interesting to see how the importance of features has changed, when I have considered only the filtered wines. In hindsight, it was to be expected. When I removed many wines based on a particular variable, one could expect that this variable will become less important in the remaining wines.
Yet I found the extent of this surprising. Density and chlorides are the features least correlate with quality in the filtered dataset. Prior to filtering, they were the most correlated features (after alcohol).
The plot below nicely shows the difference between the two data sets. The plot for all wines shows a linear tendency, while the plot for filtered wines is more scattered.
It is remarkable how strongly alcohol correlates with quality. I would be curious to know whether this is because of poor quality wines have little alcohol. Or because the high alcohol content numbs the finer distinctions between the wines.
I was curious whether I can find simple criterion to reliably tell apart medium, good and excellent wines. I couldn’t fine one. It was really interesting to see how very similar are two wines which significantly differ in quality.
The author of [1] have applied a number of machine learning techniques and came to a similar conclusion:
In general, the white [wine] data results are better: 60.3/63.3% for classes 6 and 4, 67.8/72.6% for grades 7 and 5, and a surprising 85.5% for the class 8 (the exception are the 3 and 9 extremes with 0%, not shown in the table).
Note that grade 6 wines represent 44.88% of distribution and are detected in 60.3% of cases. Grade 5 wines represent 29.75% and are detected in 72.6%. Grade 7 wines represent 17.97% and are detected in 67.8%.
In summary for 92.6% wines the detection accuracy is just above 2/3. This suggests that reliably separating wines by quality is not easy.
I choose to create no models. The main reason is that quality (the target variable) is ultimately a factor variable, albeit expressed on a numeric scale. Ultimately it is a subjective judgement of three juror on a limited scale.
Base on my investigation I am sceptical about finding an objective and universal relationship between the measured variables and assigned quality. The findings reported in [1] reinforce my scepticism.
I have drawn scatter plot of various variables (e.g. sulphates) against quality to visualize how distribution of variable varies for quality. The dots are coloured according to the rating of wines, which is variable derived from quality I use through this analysis.
To remove outliers I have plotted only values which fell into the interval [median - 2*IQR, median + 2*IQR] (within the whiskers for a box plot). I also plotted the some statistical summaries among the data points: mean (red dot) and 1st (yellow), 2nd / median (orange) and 3rd quantile (brown).
Both the scatter plots and the statistical summaries suggest that I can draw a line separating the vast majority of bad and medium wines from the rest. The dotted blue cut off line represent this decision boundary. Conversely the wines on the other side of the line are much more likely to be good or excellent.
I use such cut off values for chlorides and density (below the line) and ratio.sulfur.dioxide (above the line) to filter out wines which are bad and medium (and not good or excellent). This way I get a filtered wines dataset. I should also mentioned that I have crated ratio.sulfur.dioxide during the analysis (I divided free.sulfur.dioxide by total.sulfur.dioxide)
I do not consider alcohol content for filtering. People can consciously choose wines by alcohol content, which is unlikely for many other chemical properties measured and discussed here (e.g. chlorides).
I chose the plot above to highlight the difference between the original and filtered dataset (described above). I focused on the residual.sugar variable. For the filtered dataset, residual.sugar has the second strongest correlation with wine quality by Pearson’s R (just below alcohol and the difference between them is very small).
The plot shows that filtering has greatly decreased the range of possible values. For residual.sugar in particular, I could easily reiterate the filtering by removing the vast majority of bad and medium wines (below blue dotted line) or focus only on keeping the excellent wines (above the green dashed line).
Apart from removing subpar wines (bellow good), filtering can also accentuate the differences between the ratings.
This plot shows two areas dominated by good and excellent wines (highlighted by the yellow background). I have only plotted wines from the filtered dataset which contains a larger proportion of good and excellent (above 53% comparing with bellow 22% for all wines). The filtered dataset is also much smaller and contains only 374 wines (7.63% of total). Thus considering only filtered wines increases greatly the readability of the plot.
I have plotted total.sulfur dioxide against residual sugar. These two variables show the highest correlation (by Pearson’s R) with quality for filtered wines, with exception of alcohol. Alcohol itself is represented by size of dots. Shape represents a level of pH.
The high number of features in this plot is intentional. It is meant to illustrate that it is hard to reliably discriminate even between medium and excellent wines. Cases of very similar wines with very different ratings are highlighted by red backgrounds. I have tried different combinations of variables, but I couldn’t find a combination which reliably discriminates between good, medium and excellent wines.
Despite of removing the vast majority wines, the highlighted wines show a good variety in alcohol, pH and residual.sugar.
I have liked this project so much! I have not followed the stream of conscience/plots strictly. I did so many plots that it would be boring to keep them all in. I have just described which plots have I done and kept those which have shown something interesting.
I thing I have overdone this project, by going much deeper then quick exploration. But I was curious how far You can go using purely visualization. Relying purely on machine learning (as described in [1]) would be a sensible option. But then I wouldn’t inspect the structure of the data so closely.
Because of the structure of the project I have first tried to get the best out of univariate plots before moving on. Perhaps I spend to much time on box plots and filtering because of this. Now I would move more freely between different visualizations as need arrives.
On the other hand it was nice that I have rediscovered the same trends through multi variate analysis using Pearson’s R. Pearson’s R was a real delight. In this particular case it has proven a very useful heuristics to identify promising variables. It also allowed me to compare their relative promise.
Of course one has to be careful and always check and verify. Pearson’s R is not guaranteed giving useful meaningful insights, but then neither are most other statistical methods (except the rare cases when it can be shown that their formal requirements hold). But with messy real world data, Pearson’s R has proven to be a surprisingly useful rule of thumb for identifying promising variables for trial and error.
[1] P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.